Monitoring Server Configuration

Monitoring Server Configuration

This describes the process of building a new monitoring server from scratch. This was modeled on the HamWAN server running in our Fremont site, but build on a new vintage VM and with more modern software versions. Rather than Cacti, we use Zabbix as the monitoring engine.

Added Debian bookworm net-install ISO to proxmox server

Create a VM, and install Debian bookworm 16GB disk with LVM guided install with separate home, var, and tmp 100GB disk unconfigured for monitoring data

Boot and configure as a minimal server using the graphical interface hostname: monitoring domain: ziply.hamwan.net (root password in Vaultwarden)

apt install mg sudo pylint python3-virtualenv strace locate mtr rsyslog postfix redis-server redis-tools updatedb # update the locate database change /etc/ssh/sshd_config to move server to port 222 /sbin/groupadd hamadmin add /etc/sudoers.d/hamadmin (copied from monitoring.hamwan.net)

My initial account had to be lower case (so kd7dk). I then fixed that to my standard HamWAN KD7DK account:

  1. useradd -G hamadmin -m -c “Doug Kingston” KD7DK
  2. cd /home/KD7DK
  3. cp -pr /home/kd7dk/.ssh /home/KD7DK/.ssh
  4. chown -R KD7DK:KD7DK /home/KD7DK/.ssh
  5. chsh -s /bin/bash KD7DK
  6. userdel -r kd7dk

Note: at this point I believe our ansible automation is capable of creating all the other netop accounts.

Add Management LAN interface

Plumb the management LAN to an interface.

Install FRR (OSPF Routing)

Add routing for net 10.44.0.0/16 via mgmt LAN router. We need this to handle routing to both public and management networks without routing between them and maintaining some redundancy that we would lose with a default route. Key changes are in frr.conf and daemons to support running 2 OSPF instances.

# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.
log syslog informational

frr defaults traditional
password <PASSWORD>
enable password <PASSWORD>
log file /var/log/frr/frr.log

interface ens18
 ip ospf 1 area 0
 ip ospf authentication message-digest
 ip ospf message-digest-key 1 md5 <OSPF_PASSWORD>
 ip ospf priority 10

interface ens19
 ip ospf 2 area 0
 ip ospf priority 10

interface lo

router ospf 1
 ospf router-id 44.25.67.58
 redistribute connected
 distribute-list AMPR out connected
 network 44.25.67.0/26 area 0
 network 44.25.0.0/23 area 0
 area 0 authentication message-digest

router ospf 2
 ospf router-id 10.44.4.8
 redistribute connected
 distribute-list MGMT out connected
 network 10.44.4.0/24 area 0
 network 10.44.200.0/23 area 0

access-list AMPR permit 44.0.0.0/9
access-list AMPR permit 44.128.0.0/10
access-list MGMT permit 10.44.0.0/16

Install Zabbix

Install git.

apt install git

install Docker according to https://docs.docker.com/engine/install/debian/ from https://github.com/zabbix/zabbix-docker/tree/7.2 Follow model docker-compose_v3_ubuntu_mysql_latest.yaml

Add the following to /etc/fstab: /dev/mapper/monitoring–data–vg-data /data ext4 defaults 0 2 followed by: systemctl daemon-reload

If necessary, mount /data.

\# install mysql (via mariadb fork)
apt install mariadb-server
systemctl stop mariadb
mkdir /data/mysql
chown mysql:mysql /data/mysql

Edit /etc/mysql/mariadb.conf.d/50-server.cnf datadir = /data/mysql innodb_buffer_pool_size = 8G

Add empty database to myql, grant access to ‘zabbix’ with a password.

mysql << DONE
create database zabbix;
grant all privileges on \*.\* to 'zabbix'@'localhost' identified by 'some-password';
DONE

Create a user to run zabbix containers useradd -m -c “Zabbix Monitoring” zabbix

apt install tcpdump mg locate updatedb # (re)build the locate database

# in the container zabbix-server chmod u+s /usr/bin/fping

# Install zabbix-agent Debian package (here and on other Linux servers) apt install zabbix-agent

# then configure /etc/zabbix/zabbix_agentd.conf

Server=172.16.241.0/24
ServerActive=172.16.241.3:10051
Hostname=monitoring.ziply.hamwan.net

This uses the zabbix container addresses.

On any other repeater, this would changes would look likeL

Server=44.25.67.58
ServerActive=44.25.67.58:10051
Hostname=monitoring.ziply.hamwan.net

Server can be an address, CIDR range, or list of either. See the manual for more details.

Disable unnecessary discovery in Mikrotik Template (CapsMAN, LTE)

Starting Zabbix

cd ~zabbix/zabbix-docker docker compose -f psdr.yaml --profile all up -d

Stopping Zabbix

docker compose --profile all down

Viewing Logs (stdout/stderr)

docker logs -f _container_id_or_name_

Templates

Updated the Mikrotik by SNMP template to enhance dashboards and turn off some data collection (LTE, CAPSman)

rsyslog

Used configuration from Fremont largely unchanged. There is a key dependency on the Unfiltered.log file where the bulk of HamWAN infrastructure logs. Hacking attempts are fed into fail2ban from here. Configuration of client logging needs review and it had lots of hardcoded 44.24 addresses that were never updated. Need to consider what kind of separation we actually need.

Postfix

Basically out of the box. Will probably need to update to add authentication.

fail2ban

Used new fail2ban.conf as starting point and merged in HamWAN changes. Added new HamWAN files from Fremont with review.

Tom Hayward’s Fail2Ban docker containers

These provide a redis server and the long poll web servers for new bans. https://github.com/kd7lxl/blacklist-service

You also need to add this to 000-default-ssl.conf for apache: <Location “/blacklist”> ProxyPass “http://127.0.0.1:1234/” </Location>

This also needs the enabling of mods proxy and proxy_http: a2enmod proxy a2enmod proxy_http

SSH Public ksys

Add /srv/www/keys and this stanza to 000-default-ssl.conf for apache:

Upgrading of Zabbix

Upgrades come in 2 pieces

  1. git pull of updated docker compose configuration files
  2. docker compose pull to get new container contents

Basic steps:

  1. git checkout 7.2 (or whatever your major branch is)
  2. git pull
  3. git checkout PSDR
  4. git merge -b 7.2
  5. docker compose down <service names>
  6. docker compose up -d –build –force-recreate <service names>

TODO


Pending issues:

ICMP ping loss trigger is too sensitive. Needs a longer sample interval I believe.

redis is complaining at startup: WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.

S2.CapitolPark.hamwan.net MikroTik: Interface wlan1(): Link down Caused because a sector that loses its last client connections will change OperStatus to down instead of dormant. This trigger may need to be revised for APs, or possibly deleted for APs.

zabbix-server-1 | 69:20250205:143203.601 cannot send list of active checks to “172.16.241.1”: host [monitoring] not found. This is a configuration issue I believe.

Figure out how to ensure all the proxmox servers have the seme inventory of install images.

Resolved issues:

Web GUI shows a valid certificate but complains about active content with certificate errors. Something like this posting: https://community.letsencrypt.org/t/chrome-69-0-3497-81-reports-active-content-with-certificate-errors/71545 Resolution: This was caused by cached javascript content from the prior self-signed certificate. Cleared the cached content and all was fine.

Ping health checks failing. Needed to add setuid bit to fping in the server container.

SNMP agent item “net.if.wireless.walk” on host “r1.capitolpark.hamwan.net” failed: first network error, wait for 15 seconds (and similar) Possibly: https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/483095-zabbix-7-snmp-timeouts Fixed this by increasing the timeout for SNMP queries (Administration > General > Timeouts). I changed it from 3s to 10s.

3 Mikrotik hosts are refusing to respond to SNMP get/getbulk In this case s2.indianola, r3.baldi and capitolpark.queenanne. This turned out to be a semi-known issue with SNMP and Mikrotik, and asymmetric routes. RouterOS would respond with the address of whichever interface had the best route back to the requester. If that was not the interface that the request was sent to, the requester would be unable to match it with the request it sent. Solution is to give all devices a stable address (on their loopback interface if they have more than one interface), make that the default address in the portal, and set src-address in /snmp to force all SNMP response to come from that address.

References:

Mikrotik templates (from Zabbix and third parties) https://www.zabbix.com/integrations/mikrotik

https://www.zabbix.com/documentation/current/en/manual/discovery/network_discovery